Introduzione

BigData, Cloud and OpenSource

It is now known that a new BuzzWord is emerging on the Internet and is perhaps replacing, or adding, to the old “Cloud Computing”.

Big Data.

The two terms, which have been the subject of internet media bombardment, are also two terms that are very closely linked to each other, as we want to demonstrate below. As we also want to show how the BigData world is also deeply linked to the OpenSource world, let’s see, for example, the link between Cloud Computing and OpenSource, in our old note.

Definition

Let’s start with some definitions of BigData, to try to curb the usual sensationalist journalistic errors, or intentional errors for marketing purposes, as we have been abundantly accustomed to regarding Cloud Computing. Is there any official definition?

We quote, WikiPedia, Gartner, IBM and Villanova University in Tampa (Florida) and further on NIST:

WikiPedia EN

WikiPedia IT

Gartner – taken from the glossary

IBM – and I also suggest this infographic of theirs of the 4Vs

Villanova University

Therefore, everyone seems to agree in defining big data as “a collection of data so large and complex that it requires different tools from traditional ones to be analyzed and visualized”. Then a few differences begin:

Everyone agrees that “The data would potentially come from heterogeneous sources”, and here there are those who argue that they are all “structured data” and those who instead also add “unstructured data” to it.

Let’s come to the dimensions that this data must have to be called BigData, here obviously there is discordance and wikipedia in English rightly argues that the bigdata size is constantly on the move, it could not be otherwise considering the many studies that every year analyze the growth of data produced worldwide. In 2012, there was talk of a range of tens of terabytes to several petabytes, for each dataset, while now we are talking about zettabytes (billions of terabytes).

On the merits, we cite this provocative article by Marco Russo sent to Luca De Biase and published by him in his blog.

Everyone agrees on the 3 Vs on the characteristics of Big Data:

And some speak of a 4′ V:

But what is NIST doing about the definition of Big Data? We know that NIST moves slowly and cumbersomely, we learned this from the many months or rather years in which the definition of Cloud Computing was permanently in draft, and they started working on it since 2008.

Well, NIST begins to move when the U.S. government decides to allocate $200 million in the BigData Initiative, so the NIST BigData WorkShop and a Working Group open to all starts, as was done for the definition and all the documents related to the term Cloud Computing

Ecosystem

To show the size of the global ecosystem that revolves around this term, let’s look at three infographics from Bloomberg, Forbes and Capgemini respectively.

Big-Data-Landscape-bloomberg
Bloomberg
Forbes
Capgemini

Already from these three infographics it is evident how OpenSource solutions are massively used in the BigData ecosystem, even Forbes puts only OpenSource software in the technologies.

Dimension

Let’s take a look at the market and growth around this BigData ecosystem

According to Gartner (2012 data), Big Data Will Drive $28 Billion of IT Spending , Big Data Creates Big Jobs: 4.4 Million IT Jobs Globally to Support Big Data By 2015

And now let’s enjoy these two infographics, one from Asigra and one from IBM, which is very active in the BigData world:

In short, the Big Data market basically requires a few things:

Opportunity

BigData, in my opinion, is a great opportunity for large HW and SW IT companies (IBM, HP, EMC, Oracle, etc) as it awakens the needs of companies towards the purchase of HW rather than the use of the Public Cloud. There is also a growing need for simple, dedicated and customized SW for Data Analysis. Of course, in many cases it could be maintained and processed at Cloud Providers, and this is what market leaders such as AWS have been allowing to do for some time now, with DynamoDB, RedShift, Elastic MapReduce, but keeping petabytes or zettabytes (if these are the values we have to refer to in order to talk about Bigdata) in the Cloud costs a lot and I think it may even be convenient to maintain your own infrastructure. It’s different if we have a few terabytes of data on which we want to do DataAnalysis, and I think this is the most general scenario, where the services of a public cloud like AWS become truly competitive.

Recently, the big IT companies have opened up many opportunities for companies, startups and the world of research related to Big Data, for example EMC announces the Hadoop Starter kit 2.0, or Microsoft that offers Hadoop in the Azure cloud, or SAS allies with SAP on the Hana platform, also SAP HANA onDemand in AWS, or INTEL and AWS that offer trials and free, in short, there is something for everyone, it is a real explosion for the IT economy.

Open Source and Cloud Computing

On BigData and Cloud Computing in practice we have already answered, the possibilities are many, we have mentioned the leader maximo (AWS) and Azure, as Public Cloud offers, but also Google does not lack useful tools (BigQuery), on the other hand just remember the famous and now old BigTable by Google, which is used for their search engine.

The Public Cloud, even in the case of Big Data, can be very useful and very democratic (if we do not consider the size of the datasets as well as the definitions would have it). Think of the simplicity of not having to manage storage systems, backups, disaster recovery, of not having to manage DataAnalysis SW (if we use some PaaS or SaaS solution), of the simplicity of being able to maintain little active power during periods of non-analysis (paying little) and of being able to instantiate computing power only during our queries.

Now we come to BigData and OpenSource; as we have been able to detect so far, one name resonates strongly in all the scenarios mentioned so far, HADOOP.

Hadoop is an open-source software framework (Apache 2.0 license) for storing and processing large amounts of data in commodity hardware clusters; It was born in 2005 by Doug Cutting and Mike Cafarella and if I remember correctly it was born as a SW emulation of Google’s BigTable, for competing search engine projects.

A lot of distributed storage solutions have come out of this project, and so have many distributed storage solutions. For example, Hadoop has many child projects, such as:

to mention the most well-known in the Hadoop world.

But open source at the service of Big Data doesn’t stop there:

We’ll stop here for now, but we’ll keep updating the article.

 

Exit mobile version